Estimating False Negatives for Classification Problems with Cluster Structure

نویسندگان

György J. Simon

Vipin Kumar

Zhi-Li Zhang

چکیده

Estimating the number of false negatives for a classifier when the true outcome of the classification is ascertained only for a limited number of instances is an important problem, with a wide range of applications from epidemiology to computer/network security. The frequently applied method is random sampling. However, when the target (positive) class of the classification is rare, which is often the case with network intrusions and diseases, this simple method results in excessive sampling. In this paper, we propose an approach that exploits the cluster structure of the data to significantly reduce the amount of sampling needed while guaranteeing an estimation accuracy specified by the user. The basic idea is to cluster the data and divide the clusters into a set of “strata”, such that the proportion of positive instances in the stratum is very low, very high or in between, respectively. By taking advantage of the different characteristics of the strata, more efficient estimation strategies can be applied, thereby significantly reducing the amount of required sampling. We also develop a computationally efficient clustering algorithm – referred to as class-focused partitioning – which uses the (imperfect) labels predicted by the classifier as additional guidance. We evaluated our method on the KDDCup network intrusion data set. Our method achieved better precision and accuracy with a 5% sample than the best trial of simple random sampling with 40% samples.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Modeling and design of a diagnostic and screening algorithm based on hybrid feature selection-enabled linear support vector machine classification

Background: In the current study, a hybrid feature selection approach involving filter and wrapper methods is applied to some bioscience databases with various records, attributes and classes; hence, this strategy enjoys the advantages of both methods such as fast execution, generality, and accuracy. The purpose is diagnosing of the disease status and estimating of the patient survival. Method...

متن کامل

یک روش جدید برای تصحیح سوگرایی تاییدی در مطالعات بررسی صحت تست‌های تشخیصی با استفاده از رویکرد بیزین

Background & Objectives: One of the problems of diagnostic accuracy studies is verification bias. It occurs when standard test performed only for non-representative subsample of study subjects that diagnostic test done for them. In this study we extend a Bayesian method to correct this bias. Methods: Patients that have had at least twice repeated failures in cycles IVF ICSI were included i...

متن کامل

Voxel-based morphometry 1113 elements

An underlying assumption of the above parametric approach is that the process is a = Gaussian field, i.e., its statistical characteristics, including its roughness parameter (or its reciprocal, the smoothness ) ), are the same at each point in the image. The FWHM of the process should be constant in all directions and across all voxels in the image. While these assumptions are reasonable for fu...

متن کامل

Estimating the prevalence of atrial fibrillation in a general population using validated electronic health data

BACKGROUND The purpose of this study was to determine the prevalence of atrial fibrillation (AF) in the general population and to validate an administrative diagnosis register, ie, the National Patient Register (NPR), and an electrocardiography (ECG) database in estimating disease prevalence. METHODS The study was conducted in a well defined region in northern Sweden (population n=75,945) whi...

متن کامل

Theory of Optimizing Pseudolinear Performance Measures: Application to F-measure

State of the art classification algorithms are designed to minimize the misclassification error of the system, which is a linear function of the per-class false negatives and false positives. Nonetheless non-linear performance measures are widely used for the evaluation of learning algorithms. For example, F -measure is a commonly used non-linear performance measure in classification problems. ...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2007

Estimating False Negatives for Classification Problems with Cluster Structure

نویسندگان

چکیده

منابع مشابه

Modeling and design of a diagnostic and screening algorithm based on hybrid feature selection-enabled linear support vector machine classification

یک روش جدید برای تصحیح سوگرایی تاییدی در مطالعات بررسی صحت تست‌های تشخیصی با استفاده از رویکرد بیزین

Voxel-based morphometry 1113 elements

Estimating the prevalence of atrial fibrillation in a general population using validated electronic health data

Theory of Optimizing Pseudolinear Performance Measures: Application to F-measure

عنوان ژورنال:

اشتراک گذاری